EpiDiP/NanoDiP: a versatile unsupervised machine learning edge computing platform for epigenomic tumour diagnostics

DNA methylation analysis based on supervised machine learning algorithms with static reference data, allowing diagnostic tumour typing with unprecedented precision, has quickly become a new standard of care. Whereas genome-wide diagnostic methylation profiling is mostly performed on microarrays, an increasing number of institutions additionally employ nanopore sequencing as a faster alternative. In addition, methylation-specific parallel sequencing can generate methylation and genomic copy number data. Given these diverse approaches to methylation profiling, to date, there is no single tool that allows (1) classification and interpretation of microarray, nanopore and parallel sequencing data, (2) direct control of nanopore sequencers, and (3) the integration of microarray-based methylation reference data. Furthermore, no software capable of entirely running in routine diagnostic laboratory environments lacking high-performance computing and network infrastructure exists. To overcome these shortcomings, we present EpiDiP/NanoDiP as an open-source DNA methylation and copy number profiling suite, which has been benchmarked against an established supervised machine learning approach using in-house routine diagnostics data obtained between 2019 and 2021. Running locally on portable, cost- and energy-saving system-on-chip as well as gpGPU-augmented edge computing devices, NanoDiP works in offline mode, ensuring data privacy. It does not require the rigid training data annotation of supervised approaches. Furthermore, NanoDiP is the core of our public, free-of-charge EpiDiP web service which enables comparative methylation data analysis against an extensive reference data collection. We envision this versatile platform as a useful resource not only for neuropathologists and surgical pathologists but also for the tumour epigenetics research community. In daily diagnostic routine, analysis of native, unfixed biopsies by NanoDiP delivers molecular tumour classification in an intraoperative time frame. Supplementary Information The online version contains supplementary material available at 10.1186/s40478-024-01759-2.


Sample Preservation for Nanopore Sequencing
Cytology-preserved tumour biopsies (SurePath ® , Becton Dickinson, USA; ThinPrep ® , Hologic Inc., USA) were used to analyse tumours from referring centres outside the Basel area.DNA extraction from cytology-preserved tissue scrolls or effusion cell pellets was performed after washing with tap water.Native biopsies submitted in cytology preservatives were routinely analysed after 1-4 days, corresponding to postal shipment times.For comparative tests, we also stored native tissue in cytology preservatives for up to 21 days prior to analysis.
Samples were kept at ambient temperatures; no temperature logging during shipment by regular mail was performed.

Laboratory Process Timeline of NanoDiP
The manual steps to examine a specimen with NanoDiP require several hands-on interventions, detailed in a flow chart.
The diagnostic workflow can realistically be completed within 1h 30min, in particular if a fresh sequencing flow cell is used for an urgent specimen.

Local Laboratory Hardware
The code of NanoDiP is open-source and meant to be adjusted to individual laboratory needs (integrated Jupyter Notebook).Hard-and software requirements including hardware costs are detailed in Suppl.File 1.Briefly, NanoDiP is designed to run on aarch64 CPU/GPU hybrid SoCs and gpGPU-equipped x86_64 computers running Linux (Xubuntu 18.04, 20.04), particularly including cost-and energy-efficient CCM hardware.Alternatively, NanoDiP itself runs systems lacking a gpGPU, e.g., virtual machines and high-performance CPU-only compute clusters with the only limitation that 3rd party software (at the time of writing some nanopore basecallers) requiring a GPU can no longer be integrated.The graphical NanoDiP user interface is based on the minimalist web framework CherryPy.Recommended hardware includes >= 8 CPU threads, >= 512 GPU cores, >= 32 GB RAM, >= 1 TB NVMe.For CPU/GPU hybrid SoCs, 32GB shared RAM is sufficient, for PCIe-connected gpGPUs >=8 GPU RAM is suitable.
In our hands, SoCs enable reliable daily routine diagnostics with uptimes of well over 1 year.On SoCs, such as the Nvidia Jetson AGX Xavier 32GB, NanoDiP can control up to three Mk1B sequencers in parallel for GPU-based base and methylation calling, complemented by CPU-based UMAP and copy number analysis (Figure 3).As mentioned, such SoC systems are intended to run without an internet connection and consume approx.50W for SoC including a storage device.They have a small spatial footprint (Figure 8A in main text).

Public EpiDiP Webserver
We

Hardware Details
The term "edge computing" summarises hardware/software systems at the "network edge", i.e. smaller-scale computers which can be embedded in laboratory equipment such as sequencers, as opposed to high-performance compute clusters requiring dedicated infrastructure rooms (cooling, high-power electricity).We have chosen GPU-augmented edge computers that are widely available at an affordable cost (~EUR 2000).Two major processor platforms, x86_64 and ARMv8, have been evaluated.As an alternative to classical PCs, we have set up several NanoDiP systems with cryptocurrency mining mainboards (e.g., H510 PRO BTC+, Asrock) both on so-called mining rigs (open frames that hold GPUs) and rack-mountable cases.Mining rigs have the advantage of avoiding physical constraints when multiple GPUs are to be mounted, both in the form of dimensions as well as power supply connectors.With an emphasis on their intended purpose, mining mainboards are designed to consume as little power as possible while maximising the PCIe connectivity that we use for gpGPU and NVMe.
More recently, we have evaluated the R10 pore which required significant adaptations of the software portions responsible for obtaining and processing nanopore, previously working with the R9 pore.We are now supporting the widely available ORIN AGX Developer Kit SoC (Nvidia, USA) which is the successor platform of the AGX Xavier.The latter can equally be employed and is currently sold by third-party industry suppliers, including a 64GB RAM version.We have successfully tested two such 64GB AGX Xavier systems (dsboard from forecr.io,Turkey and from Auvidea, Germany).The table below lists hardware including costs for those computers that have been used to perform our retrospective UMAP analysis (Mac Pro A1289) as well as our routine diagnostic systems (AGX Xavier in Basel and Münster, ZBOX in Frankfurt).

Software Outline
We chose Ubuntu 18.04 as the operating system for development and production.With the implementation of the R10 pore by ONT, we have switched to Ubuntu 20.04.Closed-source code within our setup is limited to software provided free of charge along with consumables by Oxford Nanopore Technologies (ONT) and the NVIDIA software portions for gpGPU utilisation.The ONT software handles sequencer control and basecalling.The aarch64 (ARM) platform requires an in-place compilation of R and all dependencies.We provide in-place compilation scripts for aarch64 and x86_64 that enable building NanoDiP from the source on the target computer.All Python dependencies are installed with Pip (R9).The development branch NanoDiP version is supplied as a VirtualBox™ VM.
Jupyter Notebook is the integrated development environment in which we developed our software.Python and shell (bash) knowledge is sufficient to adapt our software to custom needs.NanoDiP is a CherryPy-based web application with a minimalistic, lightweight graphical user interface to initiate nanopore sequencing and launch data analysis during or after a run.Microarray data processing is controlled through the same user interface.Since NanoDiP also represents the core of our public EpiDiP [1][2][3] web service for dimension reduction plots of methylation microarray data, a local NanoDiP installation enables users to create, curate, and annotate their own reference case collections, eliminating the need to upload their array data outside their institution.All nanopore-based analyses are computed locally.As part of the NanoDiP developer mode through Jupyter Notebook, we provide access to the MinKNOW (R9) playback mode: Previously recorded runs can be recapitulated indefinitely from raw sequencing data, saving on reagents and flow cells during software tests.For productive setups, the Python code exported from Jupyter Notebook is executed as a server application.
The graphical user interface was adapted to the needs of medical-technical staff regarding sequencing run control and of (neuro)pathologists for data analysis and interpretation.Interactive HTML plots (Plotly) provide the (neuro)pathologist with diagnostic information.In addition, static reports in PDF format can be generated for import into compulsory medical documentation systems and clinical reporting.The PDF reporting functionality is also part of the EpiDiP web service.Connection to laboratory information systems is facilitated due to the open source code of NanoDiP, the export of PDF reports, and the possibility of using barcode scanners to register sample IDs.NanoDiP supports headless operation from the command line without reconfiguration (e.g., for benchmarking and research, or to generate and pull reports from the public EpiDiP server in an automated manner).Despite our recommendation to run NanoDiP with GPU support for increased processing speed, we developed NanoDiP GPU-agnostically to run on CPU-only x68_64 or ARMv8 systems including virtual machines as well.High-performance computing platforms which compensate for the lack of a GPU by allocating many CPU nodes and RAM can also run NanoDiP for data analysis (for R9 data).However, since the Dorado basecaller, unlike the prior Guppy, no longer offers CPU-only operation, R10 nanopore basecalling requires a GPU.
Performance was optimised by using random access binary files residing on PCIe-attached NVMe memory for microarray reference CpG data and making use of gpGPUs for parallel computing tasks.A scheduler avoids hardware overprovisioning.

UMAP Plotting Outline
UMAP plotting occurs after standard deviation-based methylation site selection.This is accelerated by utilising a gpGPU.The figure below illustrates the data flow on the public server as well as in the local instance of NanoDiP.In absence of a CUDA-compatible gpGPU, the system uses numpy instead of cupy functions, i.e. it is possible to run the program without modification, although at a lower speed.

GPUMAP Limitations and Alternatives
An instance of gpumap is -at the time of writing this manuscript -still available as a service since 2019 (legacy page link on www.epidip.org).The gpumap package is no longer maintained by the author since 2019 and has only been tested in CUDA 8, 9, and 10 environments.It depends on faiss (https://pypi.org/project/faiss-gpu/) which is not available for ARM platforms.The current CUDA version 11 does not work with gpumap.As a consequence, we have -in their current versions -restricted EpiDiP and NanoDiP to only use the Python CPU implementation of UMAP (umap-learn) and have instead accelerated standard deviation computation using cupy.Future NanoDiP versions will -again -be able to utilise auxiliary gpGPUs by switching to the RAPIDS AI version of the UMAP python library (Nvidia, inc.).This option is particularly attractive for cryptocurrency mining hardware hosting multiple GPUs.

Edge Computing
Following the general idea that sharing of sequencing data through various networks including the public internet is problematic in a medical care setting and that computational infrastructures may be limited, we have sought to implement NanoDiP in an affordable long-term support industrial SoC in addition to conventional x86_64 PC hardware, particularly cryptocurrency miners that work well with consumer-grade "gaming" gpGPUs.The term "edge computing'' refers to bringing the data evaluation to the network "edge", i.e. to process all data right at the place where it is obtained.In terms of laboratory accreditation, but also for software development, well-defined compute platforms that include both CPU and GPU features were chosen.This enables NanoDiP to be provided as a "one-stop shop" and almost "one button" solution so that (neuro)pathologists and medical technical staff are not required to configure the computer platform themselves.Rather, NanoDiP can be distributed in a preconfigured manner.At the same time, the open design of NanoDiP allows constant addition of reference data and incorporates the pan-cancer functionality of EpiDiP to be run with a laboratory footprint of several square centimetres.The system is intended to run in an offline mode for maximum data security and to prevent changes by (unintended) software updates.In our diagnostic routine we operate our NanoDiP devices behind a hardware-based firewall, allowing only specific IP ranges to communicate with our laboratory information management systems on the local network, reference databases, as well as backup and local monitoring systems.SoCs including breakout boards are at the time of writing priced at about EUR 2000,--and include all hardware to which the sequencing devices (Mk1B and P2 Solo, ONT, UK) connect with USB ports.Similar (or even lower) pricing applies for all parts to construct a NanoDiP computer from cryptocurrency mining hardware.If needed, NanoDiP also runs in the absence of a sequencer when aimed at data evaluation only.Overall, the proposed hardware platforms are affordable to low-income regions, given that the current WHO CNS Tumour classification [4] incorporates methylation analysis in the "desirable techniques".

User Interface
NanoDiP features a graphical frontend for nanopore sequencer control and data analysis.The user interface (UI) is operated through a local web browser.
The UI displays in-depth information on attached nanopore sequencers (Mk1B and P2 solo).Idle devices are shown with a white background.Active devices are displayed with a green background with adjacent UMAP and CNV plots calculated based on data acquired.The UI can launch ("Start seq") and stop the sequencing process ("Mk1b Status, "terminate manually" in the field describing each attached sequencer).It can automatically terminate runs upon the acquisition of sufficient data ("Click this link to launch automatic run terminator …").Already during data acquisition, users may compare the preliminary epigenetic data to reference cohorts of choice and generate copy number plots.Analyses are initiated through "Analyze" and are possible during or after a run on all data present on the NanoDiP device.More connected sequencing devices can be accessed by scrolling down (screenshot above, bottom clipped).
are sequenced per sample and at least 6 runs are performed per MinION sequencing flow cell with the SQK-RBK004 / RAP Top-up kits.The cost excludes the FFPE restoration kit (Illumina) which is not required for natively extracted DNA and typically not necessary for fresh paraffin blocks.
Nanopore (ONT) and Infinium Methylation microarrays (Illumina) both in conjunction with DNA extraction (Qiagen) sum up to approx.EUR 190 per sample and analysis (sequencing or array).This holds true if 150 megabases of DNA It is interfaced through the MinKNOW API, (https://github.com/nanoporetech/minknow_api), a Python API to control sequencer and (to some extent) live basecalling in parallel to the so-called MinKNOW UI application.The MinKNOW UI is a general-purpose, technically limited user interface for the sequencing device.The ONT-supplied basecallers Guppy and (at the time of writing) Dorado use supervised machine learning models that run significantly faster with GPU acceleration than in CPU mode.Therefore, the GPU implementations of guppy were used throughout the project.For R10 support, the current beta version of NanoDiP uses Dorado instead of Guppy for basecalling.This change, enforced by ONT, now requires a gpGPU.The MinKNOW API and portions of MinKNOW are written in Python 3.7 (binary supplied by ONT), hence all development focused on this version of Python.Notably, we have been unable to run the MinKNOW API in Python 3.8.Guppy and MinKNOW binaries are provided by ONT for Ubuntu 16.04-, 18.04-, and 20.04-based computers with x86_64 and ARM processors.The ARM implementation has been adapted from the MinIT, and more recently the Mk1C device distributions (ONT).Both the MinIT and Mk1C devices contain a predecessor of the Nvidia Jetson AGX Xavier CPU/GPU hybrid SoC.Nanopolish and f5c were compiled from the source.We made tested, pinned versions (supporting R9) available through our GitHub repository.Microarray data import requires a multitude of R packages, in particular minfi and Conumee.To ensure reproducibility, we pinned R to version 4.1.1 and Bioconductor to version 3.14 (detailed installation script in NanoDiP repository https://github.com/neuropathbasel/nanodip)for 450K/850K arrays.935K (EPIC v2) arrays are supported in the current development branch (https://github.com/neuropathbasel/nanodip_dev)through adapted versions of minfi and conumee (links below).